March 14, 2003 

3.2 Likelihood equations and errors-in-variables regression; Solari's example.. 

Here is a case, noted by Solari (1969), where the likelihood equations (3.1.1) have 
solutions, none of which are maximum likelihood estimates. 

Let (Xi,Yi), i = 1, ... ,n, be observed points in the plane. Suppose we want to do 
a form of "errors in variables" regression, in other words to fit the data by a straight 
line, assuming normal errors in both variables, so that Xi = a\ + Ui and Yi — bdi + V\ 
where Ui, . . . ,U n and V±, . . . ,V n are all jointly independent, with Ui having distribution 
N(0, a 2 ) and Vi distribution N(0, r 2 ) for i = 1, . . . , n. Here the unknown parameters are 
ai, . . . , a n , 6, a 2 and r 2 . Let c := cr 2 and h := r 2 . Then the joint density is 

(cfc)-»/ 2 (27r)-»exp (- T,7=M ~ ^) 2 /(2c) + (y, - batf/^h)) . 

Let := Yl7=i- Taking logarithms, the likelihood equations are equivalent to the 
vanishing of the gradient (with respect to all n + 3 parameters) of 

-(n/2) \og(ch) - ^T[pQ - a,) 2 /(2c) + (Y - b ai ) 2 /(2h)}. 

Taking derivatives with respect to c and h gives 

(3.2.1) c = J^-a^/n, h = ^(y,-6a,) 2 /n. 
For each i = 1, . . . , n, d/dcti gives 

(3.2.2) = C - 1 (X i -a i )+ft- 1 &(Y' i -&a i ), 
and d/db gives 

(3.2.3) ^^(Yi-M = 0. 
Next, (3.2.2) implies 

(3.2.4) ^T(X, - a,) 2 /c 2 = b 2 J2(^-b ai ) 2 /h 2 . 

From (3.2.1) it then follows that 1/c = 6 2 //i and 6^0. This and (3.2.2) imply 
b(Xi-a,i) = -(Yi-ba,i), so, 

(3.2.5) at = (X l +b- 1 Y i )/2 for i = l,...,n. 
Then from (3.2.1) again, 

(3.2.6) c = J^PQ - 6" 1 y,) 2 /(4n) and /i = ^(Y, - 6X,) 2 /(4n). 
Using (3.2.5) in (3.2.3) gives 

Y J (^-bX i )(X l +b- l Y l ) = = ^if -6 2 X 2 . 
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If E Y i > = E x f there is no solution for b. Also if £ Y 2 = < £ X? we would get 
6 = 0, a contradiction, so there is no solution in this case. If ^ X? = ^ Y 2 = then 
(3.2.3) gives = for all z since 6^0, but then (3.2.1) gives c = 0, a contradiction, 



Substituting each of these two possible values of b in (3.2.5) and (3.2.6) then determines 
values of all the other parameters, giving two points (or one point if all the Y^ are 0) where 
the likelihood equations hold, in other words critical points of the likelihood function, and 
there are no other critical points. 

Note however that the joint density goes to +oo for ai = Xi, fixed b and h, and c J, 0. 
Thus the above two points cannot give an absolute maximum of the likelihood. On the 
other hand, as c | for any fixed ai ^ Xi, the likelihood approaches 0. So the likelihood 
behaves pathologically in the neighborhood of points where aj = Xi for all i and c = 0, 
its logarithm having what is called an essential singularity. Other such singularities occur 
where Yi — bcii — > and h I 0. Here in the parametrization for the natural parameter 
space, some 0j have a 2 in the denominator, so that as a I 0, these natural parameters go 
to ±00, where singular behavior is not so surprising. 

Note that the family of densities in the example is exponential, as defined in Sec. 
2.5, but that it has no maxima for its likelihoods: this is consistent with Theorem 3.2.2 
which only guarantees uniqueness of maximum likelihood estimates when they exist, not 
existence. Also, the family parametrized by b, a±, . . . ,a n ,a 2 ,r 2 is not the full natural pa- 
rameter space, but a curved submanifold in it. For example if n = 2, 9i are the coefficients 
of Xi and 9i+2 those of Yi for i = 1,2, we have 6164 = O2O3, so that even if an MLE 
existed in the family, it would not be guaranteed to be unique. 

Having noted that noted that maximum likelihood estimation doesn't work in the 
formulation given above, let's consider some other formulations. 

Let Q v , i] G Y be a family of probability laws on a sample space X where Y is a 
parameter space. The function 77 1— > Q v is called identifiable if it's one-to-one, i.e. Q v 7^ 
for 77 7^ £ in Y. If 77 is a vector, 77 = (771 , . . . ,77/-), a component parameter 77 j will be called 
identifiable if laws Q v with different values of rjj are always distinct. Thus, 77 1— > Q v is 
identifiable if and only if each component 771, . . . ,77^ is identifiable. Suppose 9 1— > Pg for 
6 G O is identifiable and Q,, = Petri) f° r some function #(•) on Y - . Then 77 1— > is 
identifiable if and only if #(•) is one-to-one. 

Example. Let dQ^VO = ae cos ^~^d^ for < V 

< 27r where a is the suitable constant, a 
subfamily of the von Mises-Fisher family. Then 77 1— > Q,, is not identifiable for 77 G Y = M, 
but it is for Y = [0, 2tt) or Y = [-tt, tt). 

Now consider another form of errors-in-variables regression, where for i = 1,... , n, 
Xi = Xi + Ui, Yi = a + bxi + Vi, U\, ... ,U n are i.i.d. N(0, a 2 ) and independent of V\, ... ,V n 
i.i.d. N(0, t 2 ), all independent of xi,... ,x n i.i.d. N(n,( 2 ) where a, b, ji G K. and a 2 > 
0, t 2 > and ( 2 > 0. This differs from the formulation in the Solari example in that the 




> < Y. x h Then 



(3.2.7) 
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Xi, now random variables, were parameters in the example. In the present model, only 
the variables (Xi, Yj) for i = 1, . . . , n are observed and we want to estimate the parameters. 
Clearly the (JQ,Yj) are i.i.d. and have a bivariate normal distribution. The means are 
EXi = fx and EYi = v := a + bfi. The parameters fi and v are always identifiable. 
It is easily checked that C\\ := var(X^) = ( 2 + a 2 , C 22 ■= var(l^) = b 2 ( 2 + r 2 , 
and G\2 — C21 : = cov(Xj,Yj) = b( 2 . A bivariate normal distribution is given by 5 
real parameters, in this case C\\, C\ 2 , C22, A*, v - A continuous function (with polynomial 
components in this case) from an open set in IR 6 onto an open set in IR 5 can't be one- 
to-one (Appendix B), so the 6 parameters a,b, fx,( 2 ,a 2 ,T 2 are not all identifiable. Let's 
examine the situation more closely. The equation v = a + bfi can be solved for a. The 
equation C\\ = Q 2 + a 2 can be solved for a 2 > if and only if Q 2 < C\\. The equation 
C22 = b 2 ( 2 + r 2 can be solved for r 2 > if and only if b 2 ( 2 < C-zi- The equation C12 = b( 2 
can be solved for ( 2 > if and only if C12 and b have the same sign (both are positive, 
both are negative or both are 0). 

If C12 > 0, all three equations for Cij, 1 < i < j < 2, can be solved whenever 
C12/C11 < b < C 22 /C 12 . Specifically, then bC 12 > C 2 2 /Cn > 0, so b > 0. Then 
Q 2 = C\ 2 /b, b 2 ( 2 — bC\ 2 < C 22 , and C 2 = C\ 2 jb < C\\ as desired. Thus b can range over 
a non-empty open interval if C 2 2 < C11C22) i n other words detC > 0, which is true when 
the covariance matrix C is positive definite (non-singular). Thus b is not identifiable: we 
can get the same bivariate normal distribution for different values of b, and some other 
parameters will also differ. 

In the exceptional case C\ 2 = 0, since ( 2 > we get b = so it is identifiable, as are 
r 2 and a, while ( 2 and a 2 are not. 

If C12 < 0, we will have b < 0. Again we get a non-empty open interval of possible 
values of 6, C\ 2 jC\\ > b > C 22 jC\ 2 for detC > 0, so b is not identifiable. 

If the original problem is changed so that a = and we are looking for a line through 
the origin, b = v/fi is identifiable unless fx = 0. If \i = then we have non-identifiability 
much as in the previous cases. 

If, instead, we change the problem so that A := r 2 /a 2 > is assumed known, then all 
the 5 remaining parameters are identifiable (unless ACn = C22), as follows. The equation 
for C22 now becomes C22 = b 2 ^ 2 + a 2 X. After some algebra, we get an equation quadratic 
in b, 

(3.2.8) 6 2 Ci2 + (ACn -C 22 )6-ACi 2 = 0. 

Since A > 0, the equation always has real solutions. If C\ 2 = the equation becomes linear 
and either b = is the only solution or if ACn — C22 =0,6 can have any value and in this 
special case is not identifiable. If C\ 2 7^ there are two distinct real roots for 6, of opposite 
signs. Since b must be of the same sign as C\ 2 to satisfy the original equations, there is a 
unique solution for b, and one can solve for all the parameters, so they are identifiable in 
this case. 

Now, let's consider how the parameters can be estimated, still in case A is known. 
Suppose given a normal distribution N(n,C) on R. k where now fi = (/zi,... 7/ Ufc), 
C := {Cij and we have n i.i.d. observations X\, . . . , X n e R k with X r := {X ri } k =1 . 
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It's easily seen that the MLE of \i is X := {Xi}^ =1 where Xi := \Y^=i^n for 
i = l,... ,k. 

3.2.9 Theorem. The MLE of the covariance matrix C of a multivariate normal distribu- 
tion is the sample covariance matrix 

Cij '■= n Ylr=l(Xri ~ Xi)(X r j - X j) 

for i,j = 1,... ,k. 

Proof. For k = 1, this says that the MLE of the variance a 2 for a normal distribution is 
n Er=i(^r — X) 2 . (As noted above, this is (n — l)/n times the usual, unbiased estimator 
of the variance.) This is easily checked, substituting in the MLE X of the mean, then 
finding that the likelihood equation for a has a unique solution which is easily seen to give 
a maximum. 

Now in k dimensions, consider any linear function / from M. k into R such as a coordi- 
nate, f(X r ) = X ri . Let / := i Ylr=i f(X r ). Then by the one-dimensional facts, / is the 
MLE of the mean of / and the MLE of var(/) is its sample variance ^ Z^r=i(/(^) — f) 2 - 

For any function g on a parameter space, if an MLE 9 of the parameter 9 exists, then 
the MLE of g(9) is (by definition) g(9). In this sense maximum likelihood estimation is 
preserved by general functions, unlike unbiased estimation which is not generally preserved 
except by linear functions. At any rate, we see that the MLE of Cjj is Cjj for j = 1, . . . , k. 
Moreover, the MLE of va,r(Xu + X\j) is also its sample variance for any i,j = 1, . . . , k. 
Subtracting the sample variances of Xu and X\j and dividing by 2, we get the sample 
covariance of Xu and Xij, namely Cij. Since the MLE of a sum or difference of functions 
of the parameters is the sum or difference of the MLEs, we get that the sample covariance 
C i:j is indeed the MLE of C lj . □ 

Returning to the bivariate case and continuing with errors-in-variables regression for 
a fixed A, by a change of scale we can assume A = 1, in other words a 2 = r 2 . Since X and 
Y are the MLEs of the means, we see that Y = 9 = a + bjj = a + bX, the MLE regression 
line will pass through (X, Y), as it also does for the classical regression lines of y on x or 
x on y. 

For a line £ and a point {X, Y) e M 2 the Euclidean distance from {X, Y) to £ is 

d(X,Y),£) := mm(\(X,Y)-(Z, V )\: (£, rj) E £). 

Given (Xi,Yi), . . . ,(X n ,Y n ), a line £ : y = a + bx or a vertical line where x is con- 
stant will be called an best-fit-by-squared-distance (bfsd) regression line if it minimizes 

E; =1 rf((x J ,Y,),^) 2 . 

3.2.10 Proposition. For errors-in-variables regression with a 2 = t 2 > (A = 1), MLE 
lines are the same as bfsd regression lines, which always exist. The bfsd line is unique 
unless Cn = C22 = or C\\ = C22 > and C12 = 0; in case of non-uniqueness, the bfsd 
lines are all lines through (X, Y). 

Proof. For a more thorough exposition on bfsd lines, see Appendix E, where they are 
called bfd lines. Here we will concentrate on the cases where uniqueness holds. 
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The slope of the line joining a point (X, Y) to its nearest point (£, rj) in a line £ : y = 
a + bx is —1/6, and a little algebra or trigonometry gives 

d((X,Y)J) 2 = (Y -a-bX) 2 /(b 2 + 1). 

Thus for bfsd regression we want to find 

mfa,frEr=i(a + ^-^)7(& 2 + l)- 

Setting d/da = gives a unique solution a = Y — bX, which is easily seen to give a 
minimum with respect to a since the quantity being minimized goes to oo as \a\ — > oo. So 
the bfsd regression line, like the others, will pass through X, Y . Plugging in the value for 
a, we now want to find 

(3.2.11) inf 6 - X) - (Y t - Y)] 2 /(b 2 + 1). 
Taking d/db gives a quadratic equation in 6, 

(3.2.12) = (b 2 -l)C 12 + b(C 11 -C 2 2). 

This equation is the same as (3.2.8) with A = 1 and with CV, replaced by their MLEs 
Cij. Also, b is a solution of (3.2.12) if and only if —1/6 is. If 6Ci 2 < 0, then the sum 
in (3.2.11) becomes smaller if we replace 6 by —6. Thus the infimum can be restricted to 
values of 6 having the same sign as C\i. As 6 — > ±oo, the quantity being minimized in 
(3.2.11) approaches X^=i(^ ~~ -X") 2 . If the latter is we do get an infimum, so a vertical 
line through (X, Y) is a bfsd line. Or, if the Y"; are all equal, 6 = yields a minimum equal 
to and the horizontal line through (X,Y) is a bfsd line. If all the points (Xi,Yi) are 
equal, the bfsd lines are all lines through (X,Y). For further cases of no n- uniqueness see 
Appendix E. In the cases of uniqueness, the coefficients of our quadratic polynomial in 6 
are non-zero. So, bfsd regression does give the same line as maximum likelihood estimation 
for errors-in- variables regression with A = 1. □ 

PROBLEM 

1. In errors-in- variables regression with A = t 2 /o 2 = 1, suppose we observe C\\ = 1.44, 
C12 = 0.66, and C22 = 1.21. Find the MLE of the slope 6 of the regression line (i.e. the 
bfsd line). 

NOTES 

Solari (1969) showed that the critical point at which the likelihood is larger in her 
example is actually a saddle point, and so not even a local maximum of the likelihood. 
Some earlier authors had thought it was a maximum. 
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